Measuring Goal Setting in School-Aged Children: Studying the Effects of Demographic Variables in Regression-Based Norms

The aim of the study was to establish demographically representative norms for tasks measuring goal setting, and more specifically planning and reasoning in children. Three tasks were administered to n = 195 Ukrainian children aged 5.10 to 14.5 years old: the Spatial Working Memory (SWM), the Stockings of Cambridge (SOC) test, and the Naglieri Nonverbal Ability Test (NNAT). Main outcome per test was accuracy: i.e., the total number correct for the SOC and NNAT, and the total amount of incorrect responses for the SWM. Correlations among accuracy measures varied from − 0.51 to 0.60, indicating these tasks measure related but at the same time unique constructs. Higher age was associated with more accurate test performances on all outcome measures. On the NNAT, we found a curvilinear association between age and accuracy, indicating that younger children’s NNAT accuracy scores increased more with age compared with older children. We found a cubic age effect on accuracy for the SWM and SOC: i.e., test scores were relatively stable at younger and older ages, with a curvilinear increase in test scores in the other age groups. Demographically corrected norms were calculated and presented per test. These indicated that sex was not associated with accuracy scores on any of the tests. Last, a higher level of parental education (LPE) was associated with higher accuracy scores, but only on the NNAT. We conclude that demographic variables in norm analyses enhance insight in the scores and allow for application in clinical settings and research.


Introduction
Goal setting is an important skill for daily life. Anderson (2002, p. 74) defined goal setting as "the ability to develop new concepts, as well as the capacity to plan actions in advance and approach tasks in an efficient and strategic manner." Goal setting for example includes the ability to derive new notions (i.e., reasoning), planning steps to achieve end goals and applying this plan efficiently. As such, deficits in goal setting may lead to inadequate planning, and inadequate execution of plans and limited conceptualizing of novel ideas while e.g. trying to solve a problem. The number of instruments available to measure goal setting in children is relatively scarce and suitable norm data are limited. However, interpretation of test scores of a specific sample is only meaningful when compared with suitable norm data.
The current study aims to establish such norms for a sample of Ukrainian schoolchildren for three tests measuring goal setting: namely, the Spatial Working Memory (SWM) test and the Stockings of Cambridge (SOC), which are both subtests from the Cambridge Neuropsychological Test Automated Battery (CANTAB©; Cambridge Cognition 2012) and the Naglieri Nonverbal Ability Test (NNAT; Naglieri 2003), which are described in detail below. We employ a relatively novel and promising method for norming, which is called continuous norming (Van Breukelen and Vlaeyen 2005;Van der Elst et al. 2011). This method includes demographic variables of the sample to evaluate their influence on the scores and to increase the meaningfulness of the norms (Heaton et al. 2003).
We will now briefly describe the tests, which were all nonverbal and, apart from the NNAT, computer administered. This procedure also allowed testing of illiterate or lowliterate children. In the SWM task, participants see boxes on the screen, and the task is to find as fast as possible tokens hidden behind some of the boxes (see Fig. 1 for an example). A box can be opened by touching it. Applying a process of elimination enhances token search efficiency and is as such believed to measure planning ability.
The SOC is a computerized adaptation of the Tower of London task (Ozonoff et al. 2004;Shallice 1982). The SOC measures the ability to plan and to execute a multi-step solution to a problem. Participants see colored circles ("balls") in the bottom half of the screen that need to be moved to "stockings" until a pattern is created that is identical to the example pattern displayed in the top half of the screen (see Fig. 2 for an example). The trials increase in level of difficulty by increasing the number of moves needed to solve a problem.
Lastly, the NNAT consists of pattern matrices. Participants need to identify the correct missing part from five response options shown in order to complete the target drawing (see Fig. 3 for examples). This test is believed to measure reasoning, which falls within the functional domain of goal setting (Anderson 2002;Giessman et al. 2013;Naglieri and Ford 2003).
Within the domain of goal setting, the three tasks can be conceptualized as well-defined problems, i.e., with a clear start and final state (as opposed to poorly defined problems; Eysenck and Brysbaert 2018), which the child needs to solve. Reasoning in solving a problem is thought to be enhanced if the nature of the problem, the begin and end states, and the methods for a resolution are clear from the start. This welldefined analytical problem solving is thought to differ from creative problem solving (Lavric et al. 2000) in that the representations require relatively little novel conceptualization (Knoblich et al. 2001). Well-defined problems in contrast to creative problem solving (the latter may be required in poorly defined problems) may incorporate analogies and/or previous experience with comparable problems. The demands on working memory during reasoning to solve a problem and planning and executing in these tasks are thus predominantly associated with visual spatial working memory and other executive functions such as planning, error monitoring, cognitive flexibility in adapting strategies after an error, inhibition of prepotent (incorrect) steps in e.g. the SOC, and sustaining attention for the increasing duration of trials with difficulty levels (Anderson 2002;Lavric et al. 2000;Mammarella et al. 2008).
In pediatric populations, norm data available for the tests discussed here have, to the best of our knowledge, been calculated in a traditional way (De Luca et al. 2003;Lehto et al. 2003;Luciana and Nelson 2002;Roque et al. 2011). Calculations in the traditional way tend to be based on e.g. 2-year age categories that are created in a sample, requiring significant sample sizes. Other disadvantages of such subgroups in the data are that sample characteristics of a specific age group may be irregular (e.g., scoring higher than expected for an age bracket), yet become the norm ( Van der Elst et al. 2011). Also, children of virtually the same age may be in different age groups, while children who differ almost 2 years may be in the same age category: A child of 12.9 is in the 11 to 12 years old age bracket, whereas a child of 13.1 is in the 13 to 14 years old age group (Van der Elst et al. 2011). To avoid these discussion points in traditional norming, a method called continuous norming has been introduced, which will be applied in this study to collect SWM, SOC, and NNAT normative data for schoolchildren (Van Breukelen and Vlaeyen 2005;Van der Elst et al. 2011).
Continuous norming is arrived at by conducting multiple regression analyses that include several demographic variables, such as age, sex, and level of parental education (LPE; as a proxy for a child's socio-economic position). These demographic variables have been linked to differences Fig. 1 SWM on a touch screen: simultaneous view of closed boxes, one open box with a token and the column on the right with a previously found token (this is not a real CANTAB© SWM sample) Fig. 2 The SOC on a touch screen: a two-move problem. The balls in the bottom "stockings" need to be dragged and moved until they are exactly in the same position as the top half of the screen (this is not a real CANTAB© SOC example) in neuropsychological test performance, e.g., lower scores in working memory tasks have been associated with lower socioeconomic status (Lange et al. 2006;Strauss et al. 2006;Tallberg et al. 2008). For example, age-related development has been associated e.g. with improved goal setting abilities from 4 to 12 years old (Green et al. 2019;Luciana and Nelson 2002;Roque et al. 2011) and, in pediatric populations, the validity of the SWM, SOC, and NNAT was established in schoolchildren from 5 years and older (Green et al. 2019;Carman and Taylor 2010;Luciana and Nelson 2002;De Luca et al. 2003;Lehto et al. 2003;Rojahn and Naglieri 2006).
Data on the influence of children's biological sex are still inconclusive with regard to goal setting abilities in children. For instance, Pangelinan et al. (2011) did not find sex differences on the SWM in their sample of 172 children aged 6-13 years old, and Rojahn and Naglieri (2006) did not find sex differences on the NNAT in children from 6 to 9 years old. In 10-13-year-old children however, females scored slightly higher on the NNAT and there was a slight sex difference in favor of males in 15-16-year-old children. However, these authors deemed the effects too small to apply to norm scores.
Additionally, the demographic variable LPE has not been studied often in relation to goal setting abilities in schoolchildren. LPE is perceived as an estimate for socioeconomic status (SES, White 1982). It has been found to correlate positively with children's higher academic achievement (Hackman and Farah 2009) and cognitive functions such as working memory (Mezzacappa 2004;Naglieri and Ronning 2000a;Noble et al. 2007). These findings underline the relevance to study the influence of age, sex, and LPE on the reasoning and planning (in the domain of goal setting) tasks in this study and to include demographic variables in these analyses attempting to establish norms. However, the amount of studies examining the validity of goal setting tasks such as the SWM and the SOC in pediatric populations is still limited and often do not include multiple demographic variables. In the current study, our aim is to fill part of the gaps in the existing literature on the influence of demographic variables on goal setting tasks. In the statistical method of regression analyses (Van der Elst et al. 2011), the categorical variables sex (male/female) and LPE (high/low) and the continuous variable age are simultaneously included. This means that age as a continuous variable will be applied to each data point in the sample (Bechger et al. 2009). Both the option of including several demographic variables in the analyses and the application of each variable to all data (and not a specific subcategory, e.g., a 2-year age group) make these regressions more adequate and individualized (Van der Elst et al. 2011;Toornstra et al. 2020). Further, the second aim of the research is the current sample, which consists of schoolchildren in Ukraine and is therefore different compared with the mostly Western norm samples of the respective tests. Non-verbal tests are thought to be suitable for measurement in different populations but may still not be completely culture free (e.g., in their design), thus supporting the need to establish the normal range for this sample (Rosselli and Ardila 2003). Norms for these measures have, to the best of our knowledge, not been established for Ukraine and hence are needed in order to interpret scores in a meaningful manner (Lange et al. 2006). The aim of the present study was to assess the effect of demographic variables and establish a normative data range on the SWM, SOC, and NNAT, for a sample of n = 195 schoolchildren in Ukraine, aged 5.10 to 14.5 years old.

Participants
The sample of n = 195 children was enrolled in local schools in Ukraine which provides a 9-year compulsory education up until the children are in general 15 years old (Table 1). Age was approximately evenly distributed except for the youngest and oldest age groups. Per school, all children were invited and 80.5% participated. The LPE was based on the level of education completed by the parent(s) in a household; low described education up until grade 9 (similar to primary and middle schools) and high applied to grade 10 (high school or polytechnic) and above, according to the Ukrainian educational system (Ukraine Channel, 2017). Education is compulsory up until and including grade 9. The highest level was assigned as the mean for the household in the 6.5% of the cases showing a discrepancy between parents' (or caregivers') level of education (in line with e.g. Koball and Jiang 2018). Please see Table 1 for an overview of all background information for the current sample.

Procedure and Instrument
Researchers contacted schools to partake in the study, after which responding school managements invited parents to information meetings about this study. Information and consent letters were provided by the researchers and the school management and then taken home. Testing started after parents and children had given informed consent letters to their class teachers.
The SWM and SOC, as part of the Cambridge Neuropsychological Test Automated Battery (CANTAB©, Cambridge Cognition 2012), and the NNAT were administered individually in a separate room. CANTAB©-trained testers administered the SOC and SWM on a HP Pavilion TS sleek-book laptop with a 15-in. touch screen. The paper and pencil NNAT was administered on a different day within the same weeks with the provided trial booklet. Two trained researchers administered tests. Instructions followed the predescribed scripts and practice items in the school's language of instruction. All tests were terminated after their respective termination rules (see "Materials"). The research ethics committee of the Faculty of Psychology and Neuroscience of Maastricht University, The Netherlands, approved this study. All data were obtained in compliance with ethics regulations of the WMA declaration of Helsinki (2013). School management received debriefing by anonymous group reports per class. 1 SWM total errors 1 2 SOC problems solved in minimum moves − 0.53** 1 3 NNAT − 0.51** 0.60** 1 ** Correlation is significant at the 0.01 level (2-tailed)

Materials
First, SWM is a computerized self-ordered search task, which requires planning (Cambridge Cognition 2012). Lowe and Rabbitt (1998) found a moderate reliability (0.6-0.68) on the CANTAB© SWM. In this task, blue tokens need to be found in boxes that appear on the screen. Once a token has been found, this token is placed in an empty column on the right side of the screen (Fig. 1). Tokens are hidden according to pre-set rules unknown to the participant: The box where a token was found will not have a token again for the duration of this particular trial. Once a token has been found, a continued search from that box working in a systematic way is deemed most effective (Owen et al. 1990). Each trial starts with a similar screen with boxes and the trials increase in difficulty by increasing the number of tokens that need to be found to a maximum of 8 (thus increasing the chance of errors as more boxes need to be remembered that contained a token). The main outcome measure of our study is the number of errors (SWM total errors); i.e., fewer errors are thought to reflect an efficient search strategy and are therefore seen as a measure of accuracy (Cambridge Cognition 2012). The break-off point for the whole test is based on too many total errors. Total administration time is on average 10 min.
The SOC (Cambridge Cognition 2012) is a computerized task of planning and reasoning. Lowe and Rabbitt (1998) found a moderate reliability (0.6-0.68) on an older version of the CANTAB© SOC.
The SOC test procedure is as follows: Two displays with three colored balls each are shown on a screen. The balls in the lower half of the screen need to be moved until they form an exact copy of the example pattern in the upper part of the screen (see Fig. 2). The balls can only be moved according  Trajectory of the regression-based scores for the SWM, the SOC, and the NNAT (high LPE) accuracy scores by age to specific rules, e.g., only one ball can be moved at a time, and a ball at the bottom cannot be moved if there is one on top. The number of moves is seen as a measure of the ability to plan, with fewer moves indicating better planning ability. The problems increase in difficulty with respect to the necessary minimum number of moves, from 2 to 5 moves, to copy the image at the top of the screen. Each of these trials starts with a new screen. The outcome measure "Problems solved in minimum moves" reflects the amount of times a trial was successfully completed, using the minimum amount of moves. The best possible score is 12 (i.e., 12 problems solved using the minimum number of moves). The test break-off point is when three consecutive problems registered as too many moves (the exact amount is dependent on the problem, i.e., 7 moves for a problem that can be solved in 3 moves). Total administration time is on average 13 min. Last, the NNAT is an untimed one-on-one task on paper (Naglieri 2003). It purports to measure reasoning.
The test consists of four different forms of reasoning in increasingly difficult trials and thus involves cognitive flexibility and goal setting (see Fig. 3). Proof of the test's validity for children was found in correlations of the group version of the NNAT with well-established tests, such as the Raven's Progressive Matrices (RPM) test, as well as reliability from 5 years old onward (Naglieri 2003;Naglieri and Ronning 2000b;Raven 2000). The procedure is as follows: One page shows a two-colored matrix with a part left blank. The correct missing part needs to be chosen from 5 response options underneath (see Fig. 3). The outcome measure is the summation of correct responses (maximum possible is 72). Total administration time of this untimed paper test depends on the individual student; the break-off point is four consecutive incorrect trials.

Statistical Analyses
First, means and standard deviations are calculated for accuracy scores of all tests (see Table 2). Then, the observed means of the Ukrainian schoolchildren are compared with the observed means of the Western norm samples for the SWM, the SOC, and the NNAT respectively by evaluating the overlap between the confidence intervals (CIs) of the two samples. This comparison is done based on the Western age subgroups, and subgroups that are the same compared with the Ukrainian sample, which means the oldest and youngest children are excluded. Next, Pearson zero-order correlations are calculated between the Ukrainian SWM, SOC, and NNAT observed scores (Table 3). Regression models are established for each outcome measure of the SWM, SOC, and NNAT, with the demographic variables age, sex, and LPE, and their two-way interactions (e.g., age × LPE). Sex was dummy coded as 0 = female and 1 = male. LPE was dummy coded as 0 = low and 1 = high. Age is centered (Age _ C = calendar age in months − mean age of the sample 116.44 months), and quadratic (age _ C 2 ) and cubic (age _ C 3 ) age are included to evaluate curvilinear effects, i.e., scores increase less with age in older children (Van der Elst et al. 2011). Plotted observed scores and the predicted scores are visually inspected for potentially more complex curvilinear patterns. Fractional polynomial (regression) models with exponents are used to achieve the best possible fit capturing complex curvilinearity in the data. The demographic variables (age, sex, LPE) and their two-way interactions are included ( Van der Elst et al. 2016). A restricted set of 7 exponents {− 2, − 1, − 0.5, 0, 0.5, 1, 2} is used, which is typically adequate to capture complex relationships between variables (Van der Elst et al. 2016). We establish which demographic variables are predictive for each of the outcome measures of the SWM, SOC, and NNAT. The full models are reduced in a stepwise hierarchical way by removing the least significant predictor if its p value is higher than 0.05, and Akaike's Information Criterion (AIC) is used to identify the two best powers (i.e., the lower the better the fit) for the models.
Next, the assumptions of regressions analysis are evaluated for each model, that is the normal distribution of residuals, the absence of heteroscedasticity, multicollinearity, and influential cases (Field 2009;Van der Elst et al. 2006). The Kolmogorov-Smirnov test is applied to check the normal distribution of the residuals. The occurrence of multicollinearity is tested by calculating the Variance Inflation Factors (VIF ≤ 10) while Cook's distance and standard deviation of residuals > |3| are used to identify influential cases (Fisher et al. 2014). The residuals are plotted and visually inspected. Last, the standardized residuals are grouped in quartiles based on the predicted scores. Levene's test is used to evaluate heteroscedasticity in the standardized residuals based on these quartile groups using p = 0.05.
In the last step, the normative data are calculated using the expected scores for each participant based on the final models. The residuals are calculated (=observed score − expected score). The residuals are standardized using the standard deviation (SD) of the residuals of the analyzed sample (= residual/SD (residual) of the normative sample; Van der Elst et al. 2011). Last, the residuals are converted to percentile values based on the normal distribution if the normality assumption (as evaluated in the Kolmogorov-Smirnov test) is met. If this normality assumption is not met, the empirical distribution of the standardized residuals of the sample is used. Making these calculations per child in practice is tedious so tables with these converted scores are in the Appendix to support ease of use. An alpha level of 0.05 will be applied to avoid type 1 errors. All calculations are made in SPSS version 24 and R.

Results
The means and SD for the SWM, the SOC, and the NNAT outcome measures (observed untransformed scores of the Ukrainian sample) are presented in Table 2. The tests were part of a test battery and especially younger children started to drop out in the latter part of this battery consisting of the SWM and the SOC (i.e., not starting the test), while others dropped out due to absenteeism and/or illness (SWM n = 169 and SOC n = 152 children).
The results of the correlation analysis between the SWM, SOC, and NNAT outcome measures are presented in Table 3. All accuracy measures correlated significantly with each other in the expected direction, meaning that higher accuracy on one is linked to higher accuracy on the other (i.e., lower is better on SWM errors, hence the negative correlations with NNAT and SOC problems solved in minimum moves scores).
Next, the final regression models (p < 0.01) are shown in Table 4. The NNAT, the SWM, and the SOC are all curvilinear models. Box-Cox transformations were applied to SWM total error and SOC scores, because preliminary analyses suggested heteroscedasticity with the untransformed scores (Osborne 2010). Also, outliers (i.e., 2 for the NNAT), which are scores with a standardized residual SD > 3.0, were removed. After these transformations and removal of outliers, the assumptions of regression analyses were met for the final models, i.e., Kolmogorov-Smirnov values p ≥ 0.05, except for SWM total errors p = 0.000. All values of the Levene's statistic p ≥ 0.05; Cook's distance values ≤ 0.01; and all Variance Inflation Factors ≤ 1.02 (except for SWM total errors, VIF 6.4, which is below the cutoff of 10 that is indicative of multicollinearity). There were only significant main effects and no significant interaction effects. The SWM total errors did not meet the normality assumption. Transformations of the standardized scores to percentiles were therefore based on the distribution of the standardized residuals (Zhou 1998).
None of the participants obtained the maximum or lowest possible score (lowest, i.e., 0 errors on the SWM) on any of the outcome measures. The best scores achieved were 11 (out of 12) on the SOC by 2%, 2 errors on the SWM by 0.6%, and 48 (out of 72) on the NNAT by 0.5% of the participants (no ceiling effects). Last, the CIs (p = 0.01) for the grand mean scores showed overlap for the SWM (CI Ukrainian sample ranges between 40.82 and 56.92, and 18.96 and 47.99 for the CANTAB norm group; Cambridge Cognition 2014). The CIs of the grand mean scores showed minimal overlap between the Ukrainian sample and the CANTAB norm group on the SOC (CI 3.99-6.55 and 6.06-8.75 respectively). There was no overlap between the CIs of the grand mean NNAT scores (CI Ukrainian children 15.  in the NNAT norm sample; Naglieri 2003), so the score ranges differed between the two samples. These differences highlight the importance of having normative data for Ukrainian children.

Normative Procedure
Based on the established regression models, norms for the SWM, the SOC, and the NNAT outcome measures are calculated for the Ukrainian sample by applying the procedure described above (norm tables in Appendix). Box-Cox transformations were reverted back to untransformed scores (Osborne 2010). For example, presume that an 11-year-old child completed 20 items correctly on the NNAT and the level of education of the child's parents is high. The regression model for the NNAT is Constant + B 1 (Age _ C) + B 2 (Age _ C 2 ) + B 3 (LPE) with the model presented in Table 4. This is C o n s t a n t + 0 . 1 6 × ( a g e c h i l d i n m o n t h saverage age sample) + (−0.002 × Age _ C 2 ) + (1 × 7.53), that is 14.96 + [0.16 × (132 -116.44)] + (−0.002 × 242.11) + 7.53 × 1 = 24.5. The residual is computed, which is 20 -24.5 = − 4.5. In the third step, the residual is standardized − 0.68 (− 4.5/6.61). The standardized residual is turned into a percentile based on the standard normal cumulative distribution. A standardized residual of − 0.68 corresponds with a percentile value equal to 23. This means that 23% of the population of 11-year-old children with high LPE obtain a NNAT score that is equal or lower than 20. The NNAT score of this child is therefore in the lower part of the normal range (5th-95th percentile). Lastly, the NNAT outcome measure showed a quadratic age effect, which indicates a curvilinear development with age. The SWM and SOC scores showed cubic age effects, that is a trajectory with multiple curves. The crosssectional developmental trajectories of all scores are in Fig. 4. The NNAT and SOC show in principle increasing accuracy between the ages of 5.5 to 13.5 years old and then start to level off. Figure 4 also shows development of scores with age for the outcome measure SWM total errors in which a lower score is better. Several bends in the regression line are typical for a cubic age effect as in SWM total errors, which appears to stabilize from 13 years old. Norm tables are in the Appendix.

Discussion
The aim of this study was to establish demographically corrected pediatric norms for three tests measuring goal setting in school-aged children in Ukraine. Anderson (2002) defined goal setting as "the ability to develop new concepts, as well as the capacity to plan actions in advance and approach tasks in an efficient and strategic manner" (p. 74). We hence looked more specifically at planning and reasoning. The normal range of performance for Ukrainian children on these tasks was established by applying an auspicious method called continuous norming. This different approach to norming included simultaneously taking into consideration the impact of continuous and categorical demographic variables on test performance. Demographic variables included in this study are age, sex, and LPE (Van Breukelen and Vlaeyen 2005; Van der Elst et al. 2011). This method results in more suitable estimates of norm scores for reasons considered in "Introduction." The results of this study will be discussed next.
The current results show first of all that accuracy scores on the administered tests correlated in the expected directions. These correlations support the notion that reasoning and planning (including executing the planned steps to achieve a goal) occur in the same domain (EF domain), as hypothesized by Anderson (2002). The correlations are on the other hand moderate to strong (but not very strong, i.e., strong > 0.90), which indicates differences in underlying constructs in the tests.
Second, we found age-related changes on test performance on all tests; i.e., we found that on average older children made fewer errors on the SWM and solved higher amounts of problems on the SOC and the NNAT. This is in line with research by Pangelinan et al. (2011) who reported that age in typically developing children from 6 to 12.9 years old was associated with increasing accuracy on the SWM. Furthermore, the NNAT showed in principle curvilinear development with age, and stabilizing scores in the oldest children. The SOC and the SWM showed cubic development with age, i.e., some stabilizing in scores at lower and higher ages (Fig. 4). Development of working memory, a key component underlying planning as required for executing the SWM, SOC, and NNAT, reportedly has a (curvi)linear development with age from early childhood (Anderson 2002;De Luca et al. 2003;Gathercole et al. 2004;Luciana and Nelson 2002;Naglieri 2003). However, Crescentini et al. (2012) showed that planning also differs from memory (i.e., areas of brain activation involved) in a Tower of London task (comparable with the SOC). This is in line with Anderson (2002) who posed that planning and reasoning are related executive functions in the domain of goal setting, yet different from other functions such as information processing (which includes memory). Therefore, in principle, increased accuracy found with increasing age may be associated with planning and reasoning (and thus goal setting), which was also shown in the relatively fewer studies on normative data in children on the SWM and SOC (Green et al. 2019;De Luca et al. 2003;Lehto et al. 2003;Luciana and Nelson 2002;Roque et al. 2011).
Third, the scores of the Ukrainian sample were compared with the Western norm to examine whether test scores overlapped (Rosselli and Ardila 2003). The comparison was limited to traditional 2-year age subgroups, which was the only available demographic variable for both populations (Cambridge Cognition 2014;Naglieri 2003). The CIs of the grand mean for the SWM overlapped between the Ukrainian sample and the CANTAB norm sample, which was also, but to a lesser degree, the case for the SOC scores. This indicates a comparable performance in planning, and goal setting in Ukrainian scores and Western scores (Lehto et al. 2003). The NNAT however showed no overlap in CIs between the Ukrainian sample compared with the Western mean scores. Additionally, 0.5% in the Ukrainian sample achieved the highest score on the NNAT, so there were no test ceiling effects. Taking this together, the NNAT scores appear to indicate a lower mean compared with the Western population. Caution is warranted as these comparisons were only based on descriptive data (CIs) and did not include the demographic variables sex and LPE, which would need to be included in future statistically more rigorous analyses assessing cultural differences in test scores.
Fourth, we found neither sex effects nor an age by sex interaction on any of the tests. This is in line with Luciana and Nelson (1998) who did not find sex differences on the accuracy scores of the SOC and SWM. Sex differences were found in a study on a large sample of the NNAT, but Rojahn and Naglieri (2006) deemed effect sizes negligible and were subsequently not included in the standardized norms, thus supporting the findings here.
The last finding is an effect of the demographic variable of LPE on the NNAT, linking higher NNAT accuracy to higher LPE. LPE is associated with the socio-economic status of families, which in turn is linked to life outcomes such as cognitive development of children (Bradley and Corwyn 2002). Bronfenbrenner and Morris (2006) hypothesized that planning is an example of a cognitive ability that is subject to environmental influences, i.e., a child that experiences modeling and learning of planning from e.g. the adults in his/her life may be more likely to use it when a situation requires. Lohman and Gambrell (2012) however found that children of lower SES performed better on the NNAT. Lohman and Gambrell (2012) link these results to test design differences, i.e., describing the NNAT and the Raven's as figural reasoning tests as opposed to most other non-verbal reasoning tests. The latter are thought to use, among others, pictures and line drawings which may be associated with other constructs in testing (e.g., language).
There are some limitations in the current study that need to be addressed. First, the number of parents with a low LPE group was smaller than the high LPE group. In multiple linear regressions, each data point of the whole sample is used (i.e., relation of the demographic variables to the dependent variable of the test score); thus, fewer data are needed to arrive at statistically valid results (Van Breukelen and Vlaeyen 2005). Similarly, these regressions allow for a smaller overall sample, which supports these analyses despite higher dropout rates in the SWM and SOC which were in the last half of this test battery. Secondly, the number of regressions computed might increase the chance of type I errors. Although it cannot be stated with absolute certainty that some relationships are not chance findings (Piechatzek et al. 2009), the discussion shows that the findings in our study are supported by (neuro)psychological literature. Furthermore, discussing the difference between short-term memory and working memory is not within the scope of this study (Faw 2003). Discussions here focus on the ability of keeping information in memory for a short while and to manipulate that information according to task demands, which aligns with Anderson's (2002) conceptualization of working memory.

Conclusion
Demographically corrected norms were established for three tests measuring planning and reasoning (in the domain of goal setting) in children, i.e., the SWM, the SOC, and the NNAT. Demographic variables (age, sex, LPE) were associated with test scores on these tests, but not all were associated with the scores in this sample. Including demographic variables is therefore important when calculating norm scores (Mitrushina et al. 2005). Two tests showed differences in score trajectories compared with Western norms (i.e., cubic age effects), which supports our contention to examine differences between populations in relation to test scores (in line with Rosselli and Ardila 2003). Providing the norm tables and establishing the normal range for different populations especially for functions that are difficult to evaluate (e.g., planning and reasoning and thus goal setting) is important for use in clinical and research settings.